Cluster Hypothesis in Low-Cost IR Evaluation with Different Document Representations
نویسندگان
چکیده
Offline evaluation for information retrieval aims to compare the performance of retrieval systems based on relevance judgments for a set of test queries. Since manual judgments are expensive, selective labeling has been developed to semiautomatically label documents, in the wake of the similarity relationship among retrieved documents. Intuitively, the agreement w.r.t the cluster hypothesis can directly determine the amount of manual judgments that can be saved by creating labels with a semi-automatic method. Meanwhile, in representing documents, certain information is lost. We argue that better document representation can lead to better agreement with the cluster hypothesis. To this end, we investigate different document representations on established benchmarks in the context of low-cost evaluation, showing that different document representations vary in how well they capture document similarity relative to a query.
منابع مشابه
Personal Name Resolution of Web People Search
Disambiguating personal names in a set of documents (such as a set of web pages returned in response to a person name) is a difficult and challenging task. In this paper, we explore the extent to which the “cluster hypothesis” for this task holds (i.e., that similar documents tend to represent the same person). We explore two clustering techniques which used either (1) term based matching (sing...
متن کاملDocument Clustering Algorithms, Representations and Evaluation for Information Retrieval
Digital collections of data continue to grow exponentially as the information age continues to infiltrate every aspect of society. These sources of data take many different forms such as unstructured text on the world wide web, sensor data, images, video, sound, results of scientific experiments and customer profiles for marketing. Clustering is an unsupervised learning approach that groups sim...
متن کاملTesting the cluster hypothesis in distributed information retrieval
How to merge and organise query results retrieved from different resources is one of the key issues in distributed information retrieval. Some previous research and experiments suggest that cluster-based document browsing is more effective than a single merged list. Cluster-based retrieval results presentation is based on the cluster hypothesis, which states that documents that cluster together...
متن کاملDocument Clustering: Before and After the Singular Value Decomposition
Document Clustering is an issue of measuring similarity between documents and grouping similar documents together. Information Retrieval (IR) is an issue of comparing query with a collection of documents to locate a set of documents relevant to a particular query. In the vector space IR model, a query is treated as a document which consists of a few terms. Therefore, in both clustering and retr...
متن کاملSemi-Structured Document Classification
INTRODUCTION Document classification developed over the last ten years, using techniques originating from the pattern recognition and machine learning communities. All these methods do operate on flat text representations where word occurrences are considered independents. The recent paper (Sebastiani, 2002) gives a very good survey on textual document classification. With the development of st...
متن کامل